Introduction

The following anaylsis is based on the dataset which is P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Exploratoring the Dataset

An Overview of the Data

The Wine Qualtiy dataset variables are as the following.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

The type of the variables are as the following.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Therefore, it is observed that the variable X indicates an index for every observation which is found in the dataset. Also, the other variables in the dataset are quantified by the use of numerical data. In addition, the quality variable is an integer.

A closer look on the variability in the numerical data is as the following.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

A boxplot can be used to visualize the variability of every variable as the following.

## Using  as id variables

The lower and also the upper whiskers extend to the lowest and the highest points between 1.5 multiplied by the inter quartile range. Moreover, histograms can be used to plot every varaibale in order to help in understanding the distribution of every variable as the following.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most of the distrubution of the variables look normal. Three of the variables appear that they have lognormal distributions which are Alcohol, sulphates and total sulfur dioxide. Furthermore, It is somehow difficult to see the distribution because of the outliers for the two variables which are residual sugar and chlorides. In the following. 95th percentile is going to be excluded which belongs to the residual sugar and chlorides.Also, the histograms for both are going to reploted.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 79 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 80 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

After excluding the outliers for both residual sugar and chlorides, the distribution looks normal.

The statistical summary of residual sugar is as the following.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The statistical summary of chlorides is as the following.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The quality rating and the variables which are influencing the quality rating are the interest. Quality ratings are visualized as histograms as the following.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The most quality has a ranking which is between 5 and 6.

The content of the alcohol must be taken into account when people buy wine, so:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

It is shown in the figure that the content of the alcohol has a lognormal distribution with a high peak on the lower part of the scale of the alcohol.

Bivariate Relationships

The relationship between every pair of the variables and their respective pearson product moment correlatio can be quickly visualized. The x and y axises’ names for the plot matrix are as the following.

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

The highest four correlation coefficients with the quality are as the following. * alchol: quality = 0.48 * sulphates: quality = 0.25 * citric.acid: quality = 0.23 * fixed.acidity: quality = 0.12

The highest negative correlation coefficients with the quality are as the following. * volatile.acidity:quality = -0.39 * total.sulfur.dioxide:quality = -0.19 * density:quality = -0.17 * chlorides:quality = -0.13

Variables with the highest (positive or negative) correlation are as the following. * fixed.acidity:citirc.acid = 0.67 * fixed.acidity:density = 0.67 * free.sulfur.dioxide:total.sulfur.dioxide = 0.67 * alcohol:quality = 0.48 * density:alcohol = -0.50 * citric.acid:pH = -0.54 * volatile.acidity:citirc.acid = -0.55 * fixed.acidity:pH = -0.68

Having a closer look at the relationships with more details which are density and alcohol.

As shown the density increases when the content of the alcohol decreases.

When fixed acidity and pH.

As shown the fixed density increases when the pH decreases.

When fixed acidity and density.

As shown the fixed acidity increases when the density increases

Having a closer look at the content of the alcohol by the wine’s quality with the use of a density plot function.

It is shown that the wine with a high content of alcohol has the tendency to have a high rating for quality. Also, it appears to be having a ranking for quality of 5.

The statistical summary for the content of the alcohol for every quality level is as the following.

## factor(dataset$quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## factor(dataset$quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## factor(dataset$quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## factor(dataset$quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## factor(dataset$quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## factor(dataset$quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

It is shown that the content of the sulphate is important when it comes to the wine. In particular, for the high levels of quality which includes 7 and 8 qualities.

The statistical summary for the sulphates of the alcohol for every quality level is as the following.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

Multivariate Plots

Visualizing of the relationship among sulphates, volatile.acidity, alcohol and quality is as the following.

It is shown that the high quality of wine is mostly concentrated in the top left part of the plot. Also, this means that the high content of alcohol is also there which is represented by large dots.

Summarizing of the quality by the use of a contour plot for the content of alcohol and sulphate is as the following.

It is shown that the high quality of wine is mostly located in the top right part of the plot which is represented by darker contour lines. However, the low quality of wine is mostly located in the bottom right part of the plot.

Visualizing of the quality by the use of density plots which is along the along x an y axises and color is as the following.

It is shown that the high quality of wine is mostly located in the top right part of the plot.

Final Plots and Summary

In the coming parts, a summary of the main findings with refined plots.

The highest correlation coefficient was among the quality and the alcohol. Having a closer look on the content of the alcohol by the quality of the wine with the use of density plot function. This is as the following.

As shown the density plots for the high quality of wine which are shifted right. They are indicated by red plots which means that these have a comparative high content of alcohol. This is compared to the low quality of the wine. Also, wine appears to be having a ranking for quality of 5.

The statistical summary of the content of alcohol for every level of quality is as the following.

## dataset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## dataset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## dataset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## dataset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## dataset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## dataset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Also, it is found out that sulphates are correlated with the quality of the wine which is ( R^2 = 0.25 ). Whereas, the volatile acid has a negative correlation which is ( R^2 = -0.39 ). The following scatter plot represents the relationship between the sulphates and the volatile acid. Also, with content of the alcohol and the quality of the wine.

It is shown that the high quality of wine is mostly concentrated in the top right part of the plot. Also, there are large dots which are concentrated in the same area.

The summary of the content of the alcohol by the quality of rating is as the following.

## dataset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## dataset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## dataset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## dataset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## dataset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## dataset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

By the content of the sulphate is as the following.

## dataset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## dataset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## dataset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## dataset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## dataset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## dataset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

By the volatile acidity is as the following.

## dataset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## dataset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## dataset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## dataset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## dataset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## dataset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

The following density plot represents the relationship between the content of the alcohol and the sulphates by a combination of scatter plot and density plot.

Reflection

This dataset which is expressed in this exercise includes 1599 information about different wines with 12 variables. Firsly, the start in this project was by understanding and analyzing the variables. Secondly, exploring questions to make observations on the plots. Finally, analyzing the wine’s quality between the variables.

The analysis in this project has considered the relationship of the attributes of the wine with the quality of different types of wines. In addition, melting the dataframe and also using facet grids have been very helpful in visualizing the distributions of every parameter by using boxplots and histograms. Majority of the parameters are distributed in a normal way. Whereas, citirc acid, free sulfur dioxide, total sulfur dioxide and alcohol have a tendency of a lognormal distribution.

The future work can be developing a model to analyze and predict the quality of wine according to the same dataset.